tg-me.com/data_analysis_ml/3536
Last Update:
🧩 The Ultimate LLM Benchmark Collection
Подборка живых бенчмарков, которые стоит открывать при каждом релизе новой модели — и тех, на которые можно больше не тратить время.
🌐 Общие (multi‑skill) лидерборды
SimpleBench — https://simple-bench.com/index.html
SOLO‑Bench — https://github.com/jd-3d/SOLOBench
AidanBench — https://aidanbench.com
SEAL by Scale (MultiChallenge) — https://scale.com/leaderboard
LMArena (Style Control) — https://beta.lmarena.ai/leaderboard
LiveBench — https://livebench.ai
ARC‑AGI — https://arcprize.org/leaderboard
Thematic Generalization (Lech Mazur) — https://github.com/lechmazur/generalization
дополнительные бенчмарки Lech Mazur:
Elimination Game — https://github.com/lechmazur/elimination_game
Confabulations — https://github.com/lechmazur/confabulations
EQBench (Longform Writing) — https://eqbench.com
Fiction‑Live Bench — https://fiction.live/stories/Fiction-liveBench-Mar-25-2025/oQdzQvKHw8JyXbN87
MC‑Bench (сортировать по win‑rate) — https://mcbench.ai/leaderboard
TrackingAI – IQ Bench — https://trackingai.org/home
Dubesor LLM Board — https://dubesor.de/benchtable.html
Balrog‑AI — https://balrogai.com
Misguided Attention — https://github.com/cpldcpu/MisguidedAttention
Snake‑Bench — https://snakebench.com
SmolAgents LLM (из‑за GAIA & SimpleQA) — https://huggingface.co/spaces/smolagents/smolagents-leaderboard
Context‑Arena (MRCR, Graphwalks) — https://contextarena.ai
OpenCompass — https://rank.opencompass.org.cn/home
HHEM (Hallucination) — https://huggingface.co/spaces/vectara/leaderboard
🛠️ Coding / Math / Agentic
Aider‑Polyglot‑Coding — https://aider.chat/docs/leaderboards/
BigCodeBench — https://bigcode-bench.github.io
WebDev‑Arena — https://web.lmarena.ai/leaderboard
WeirdML — https://htihle.github.io/weirdml.html
Symflower Coding Eval v1.0 — https://symflower.com/en/company/blog/2025/dev-quality-eval-v1.0-anthropic-s-claude-3.7-sonnet-is-the-king-with-help-and-deepseek-r1-disappoints/
PHYBench — https://phybench-official.github.io/phybench-demo/
MathArena — https://matharena.ai
Galileo Agent Leaderboard — https://huggingface.co/spaces/galileo-ai/agent-leaderboard
XLANG Agent Arena — https://arena.xlang.ai/leaderboard
🚀 Для отслеживания AI take‑off
METR Long‑Task Benchmarks (вкл. RE Bench) — https://metr.org
PaperBench — https://openai.com/index/paperbench/
SWE‑Lancer — https://openai.com/index/swe-lancer/
MLE‑Bench — https://github.com/openai/mle-bench
SWE‑Bench — https://swebench.com
🏆 Обязательный «классический» набор
GPQA‑Diamond — https://github.com/idavidrein/gpqa
SimpleQA — https://openai.com/index/introducing-simpleqa/
Tau‑Bench — https://github.com/sierra-research/tau-bench
SciCode — https://github.com/scicode-bench/SciCode
MMMU — https://mmmu-benchmark.github.io/#leaderboard
Humanities Last Exam (HLE) — https://github.com/centerforaisafety/hle
🔍 Классические бенчмарков
Simple‑Evals — https://github.com/openai/simple-evals
Vellum AI Leaderboard — https://vellum.ai/llm-leaderboard
Artificial Analysis — https://artificialanalysis.ai
⚠️ «Перегретые» метрики, на которые можно не смотреть
MMLU, HumanEval, BBH, DROP, MGSM
Большинство чисто‑математических датасетов: GSM8K, MATH, AIME, ...
Модели близки к верхним значениям на них и в них нет особого смысла.
BY Анализ данных (Data analysis)
Warning: Undefined variable $i in /var/www/tg-me/post.php on line 283
Share with your friend now:
tg-me.com/data_analysis_ml/3536